Robust Keyword Retrieval Method for OCRed Text
Identifieur interne : 000547 ( Main/Exploration ); précédent : 000546; suivant : 000548Robust Keyword Retrieval Method for OCRed Text
Auteurs : Yusaku Fujii [Japon] ; Hiroaki Takebe [Japon] ; Hiroshi Tanaka [Japon] ; Yoshinobu Hotta [Japon]Source :
- Proceedings of SPIE, the International Society for Optical Engineering [ 0277-786X ] ; 2011.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Document électronique, Recherche documentaire.
English descriptors
- KwdEn :
Abstract
Document management systems have become important because of the growing popularity of electronic filing of documents and scanning of books, magazines, manuals, etc., through a scanner or a digital camera, for storage or reading on a PC or an electronic book. Text information acquired by optical character recognition (OCR) is usually added to the electronic documents for document retrieval. Since texts generated by OCR generally include character recognition errors, robust retrieval methods have been introduced to overcome this problem. In this paper, we propose a retrieval method that is robust against both character segmentation and recognition errors. In the proposed method, the insertion of noise characters and dropping of characters in the keyword retrieval enables robustness against character segmentation errors, and character substitution in the keyword of the recognition candidate for each character in OCR or any other character enables robustness against character recognition errors. The recall rate of the proposed method was 15% higher than that of the conventional method. However, the precision rate was 64% lower.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000132
- to stream PascalFrancis, to step Curation: 000641
- to stream PascalFrancis, to step Checkpoint: 000101
- to stream Main, to step Merge: 000553
- to stream Main, to step Curation: 000547
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Robust Keyword Retrieval Method for OCRed Text</title>
<author><name sortKey="Fujii, Yusaku" sort="Fujii, Yusaku" uniqKey="Fujii Y" first="Yusaku" last="Fujii">Yusaku Fujii</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Takebe, Hiroaki" sort="Takebe, Hiroaki" uniqKey="Takebe H" first="Hiroaki" last="Takebe">Hiroaki Takebe</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Tanaka, Hiroshi" sort="Tanaka, Hiroshi" uniqKey="Tanaka H" first="Hiroshi" last="Tanaka">Hiroshi Tanaka</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Hotta, Yoshinobu" sort="Hotta, Yoshinobu" uniqKey="Hotta Y" first="Yoshinobu" last="Hotta">Yoshinobu Hotta</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">11-0279163</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0279163 INIST</idno>
<idno type="RBID">Pascal:11-0279163</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000132</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000641</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000101</idno>
<idno type="wicri:doubleKey">0277-786X:2011:Fujii Y:robust:keyword:retrieval</idno>
<idno type="wicri:Area/Main/Merge">000553</idno>
<idno type="wicri:Area/Main/Curation">000547</idno>
<idno type="wicri:Area/Main/Exploration">000547</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Robust Keyword Retrieval Method for OCRed Text</title>
<author><name sortKey="Fujii, Yusaku" sort="Fujii, Yusaku" uniqKey="Fujii Y" first="Yusaku" last="Fujii">Yusaku Fujii</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Takebe, Hiroaki" sort="Takebe, Hiroaki" uniqKey="Takebe H" first="Hiroaki" last="Takebe">Hiroaki Takebe</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Tanaka, Hiroshi" sort="Tanaka, Hiroshi" uniqKey="Tanaka H" first="Hiroshi" last="Tanaka">Hiroshi Tanaka</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Hotta, Yoshinobu" sort="Hotta, Yoshinobu" uniqKey="Hotta Y" first="Yoshinobu" last="Hotta">Yoshinobu Hotta</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>FUJITSU LABORATORIES LTD., 1-1 Kamikodanaka 4-chome</s1>
<s2>Nakahara-ku, Kawasaki</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Nakahara-ku, Kawasaki</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Document management</term>
<term>Document retrieval</term>
<term>Electronic document</term>
<term>Imagery</term>
<term>Information retrieval</term>
<term>Keyword</term>
<term>Optical character recognition</term>
<term>Robustness</term>
<term>Segmentation</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Imagerie</term>
<term>Mot clé</term>
<term>Reconnaissance optique caractère</term>
<term>Gestion document</term>
<term>Document électronique</term>
<term>Recherche documentaire</term>
<term>Recherche information</term>
<term>Reconnaissance caractère</term>
<term>Segmentation</term>
<term>Robustesse</term>
<term>0130C</term>
<term>4230</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Document électronique</term>
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Document management systems have become important because of the growing popularity of electronic filing of documents and scanning of books, magazines, manuals, etc., through a scanner or a digital camera, for storage or reading on a PC or an electronic book. Text information acquired by optical character recognition (OCR) is usually added to the electronic documents for document retrieval. Since texts generated by OCR generally include character recognition errors, robust retrieval methods have been introduced to overcome this problem. In this paper, we propose a retrieval method that is robust against both character segmentation and recognition errors. In the proposed method, the insertion of noise characters and dropping of characters in the keyword retrieval enables robustness against character segmentation errors, and character substitution in the keyword of the recognition candidate for each character in OCR or any other character enables robustness against character recognition errors. The recall rate of the proposed method was 15% higher than that of the conventional method. However, the precision rate was 64% lower.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
</country>
</list>
<tree><country name="Japon"><noRegion><name sortKey="Fujii, Yusaku" sort="Fujii, Yusaku" uniqKey="Fujii Y" first="Yusaku" last="Fujii">Yusaku Fujii</name>
</noRegion>
<name sortKey="Hotta, Yoshinobu" sort="Hotta, Yoshinobu" uniqKey="Hotta Y" first="Yoshinobu" last="Hotta">Yoshinobu Hotta</name>
<name sortKey="Takebe, Hiroaki" sort="Takebe, Hiroaki" uniqKey="Takebe H" first="Hiroaki" last="Takebe">Hiroaki Takebe</name>
<name sortKey="Tanaka, Hiroshi" sort="Tanaka, Hiroshi" uniqKey="Tanaka H" first="Hiroshi" last="Tanaka">Hiroshi Tanaka</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000547 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000547 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:11-0279163 |texte= Robust Keyword Retrieval Method for OCRed Text }}
This area was generated with Dilib version V0.6.32. |